AllLife Bank - Credit Card Customer Segmentation

General Overview¶

Background & Context¶

AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalized campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customer queries are resolved faster.

Objective¶

  • Explore and visualize the dataset.
  • Identify different segments in the existing customer, based on their spending patterns as well as past interaction with the bank, using clustering algorithms.
  • Generate a set of insights and recommendations that will help the bank to:

    ◎ Personalize Marketing campaigns to target both new and existing customers.

    ◎ Upgrade the service delivery model, to ensure that customer queries are resolved faster.

Key Questions:¶

  1. How many different segments of customers are there?
  2. How are these segments different from each other?
  3. What are your recommendations to the bank on how to better market to and service these customers?

Data Dictionary:¶

  • Sl_No: Primary key of the records
  • Customer Key: Customer identification number
  • Average Credit Limit: Average credit limit of each customer for all credit cards
  • Total credit cards: Total number of credit cards possessed by the customer
  • Total visits bank: Total number of visits that customer made (yearly) personally to the bank
  • Total visits online: Total number of visits or online logins made by the customer (yearly)
  • Total calls made: Total number of calls made by the customer to the bank or its customer service department (yearly)

Sanity Checks¶

Importing Necessary Libraries¶

In [1]:
#
# Loading Necessary Libraries
#

# To help with reading and manipulating data
import pandas as pd
import numpy as np

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import animation

sns.set(
    color_codes=True
)  # -----This adds a background color to all the plots created using seaborn

# Allow the use of Display via interactive Python
from IPython.display import display

# Import tabulate. library used for creating tables in a visually appealing format.
from tabulate import tabulate

# to scale the data using z-score
from sklearn.preprocessing import StandardScaler

# to compute distances
from scipy.spatial.distance import cdist, pdist

# to perform k-means clustering and compute silhouette scores
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

# to visualize the elbow curve and silhouette scores
from yellowbrick.cluster import (
    KElbowVisualizer,
    SilhouetteVisualizer,
    InterclusterDistance,
)


# to perform hierarchical clustering, compute cophenetic correlation, and create dendrograms
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage, cophenet

# For Dimensionality Reduction
from sklearn.decomposition import PCA

# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)

# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)

# To supress warnings
import warnings

warnings.filterwarnings("ignore")

# Making the Python code more structured automatically
%load_ext nb_black

print("Loading Libraries... Done.")
Loading Libraries... Done.

Loading the Dataset¶

In [2]:
# Loading Dataset
data_path = "Credit+Card+Customer+Data.xlsx"
data = pd.read_excel(data_path)

# Making a copy of the data to avoid any changes to original data
df = data.copy()

print("Loading Dataset... Done.")
Loading Dataset... Done.

Data Overview¶

Checking a few rows of the Dataset¶

In [3]:
# Checking the top 5 and bottom 5

display(df.head())  # -----looking at head (top 5 observations)
display(df.tail())  # -----looking at tail (bottom 5 observations)
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1 87073 100000 2 1 1 0
1 2 38414 50000 3 0 10 9
2 3 17341 50000 7 1 3 4
3 4 40496 30000 5 1 1 4
4 5 47437 100000 6 0 12 3
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
655 656 51108 99000 10 1 10 0
656 657 60732 84000 10 1 13 2
657 658 53834 145000 8 1 9 1
658 659 80655 172000 10 1 15 0
659 660 80150 167000 9 0 12 2

Observations

  • We will be dropping Sl_No and Customer Key columns.

Checking the shape of the dataset¶

In [4]:
# -----Print the dimension of the data
print(f"There are {df.shape[0]} rows and {df.shape[1]} columns")
There are 660 rows and 7 columns

Checking the Data Types & General Information of the Dataset¶

In [5]:
# -----Displaying information about features of the Dataset
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 660 entries, 0 to 659
Data columns (total 7 columns):
 #   Column               Non-Null Count  Dtype
---  ------               --------------  -----
 0   Sl_No                660 non-null    int64
 1   Customer Key         660 non-null    int64
 2   Avg_Credit_Limit     660 non-null    int64
 3   Total_Credit_Cards   660 non-null    int64
 4   Total_visits_bank    660 non-null    int64
 5   Total_visits_online  660 non-null    int64
 6   Total_calls_made     660 non-null    int64
dtypes: int64(7)
memory usage: 36.2 KB

Observations

  • There are no missing values.
  • All the Columns are of type Integer.

Getting the Statistical Summary for the Dataset¶

For Numerical Variables

In [6]:
# -----Displaying Statistical Summary of Numerical Data
df.describe().T
Out[6]:
count mean std min 25% 50% 75% max
Sl_No 660.000 330.500 190.670 1.000 165.750 330.500 495.250 660.000
Customer Key 660.000 55141.444 25627.772 11265.000 33825.250 53874.500 77202.500 99843.000
Avg_Credit_Limit 660.000 34574.242 37625.488 3000.000 10000.000 18000.000 48000.000 200000.000
Total_Credit_Cards 660.000 4.706 2.168 1.000 3.000 5.000 6.000 10.000
Total_visits_bank 660.000 2.403 1.632 0.000 1.000 2.000 4.000 5.000
Total_visits_online 660.000 2.606 2.936 0.000 1.000 2.000 4.000 15.000
Total_calls_made 660.000 3.583 2.865 0.000 1.000 3.000 5.000 10.000

Observations

  • The average Avg_Credit_Limit of a customer is about 35,000.
  • The average Total_Credit_Cards is about 5.
  • There are customers that do not visit the bank, (minimum Total_visits_bank is zero).
  • There are customers that do not go online, (minimum Total_visits_online is zero).
  • There are customers that do not call the bank at all, (minimum Total_calls_made is zero).

Checking Missing, Duplicate & Unique Values¶

Checking for Missing Values¶

In [7]:
# Checking missing values across each columns

c_missing = pd.Series(df.isnull().sum(), name="Missing Count")  # -----Count Missing

p_missing = pd.Series(
    round(df.isnull().sum() / df.shape[0] * 100, 2), name="% Missing"
)  # -----Percentage Missing


# Combine into 1 Dataframe
missing_df = pd.concat([c_missing, p_missing], axis=1)

# # Display missing info
# display(missing_df)

missing_df.sort_values(by="% Missing", ascending=False).style.background_gradient(
    cmap="YlOrRd"
)
Out[7]:
  Missing Count % Missing
Sl_No 0 0.000000
Customer Key 0 0.000000
Avg_Credit_Limit 0 0.000000
Total_Credit_Cards 0 0.000000
Total_visits_bank 0 0.000000
Total_visits_online 0 0.000000
Total_calls_made 0 0.000000

Observations

  • There are no missing values.

Checking for Duplicate Records¶

In [8]:
# Checking for duplicate records

df.duplicated().sum()
Out[8]:
0

Observations

  • There are no duplicate records.

Checking for Duplicate Value Entries¶

In [9]:
# checking the number of unique values in each column
df.nunique()
Out[9]:
Sl_No                  660
Customer Key           655
Avg_Credit_Limit       110
Total_Credit_Cards      10
Total_visits_bank        6
Total_visits_online     16
Total_calls_made        11
dtype: int64
In [10]:
# getting the count for each unique value in Customer Key
filtered = df.groupby("Customer Key").count()

for i in filtered.loc[filtered.Sl_No >= 2].index:
    display(df.loc[df["Customer Key"] == i])
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
48 49 37252 6000 4 0 2 8
432 433 37252 59000 6 2 1 2
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
4 5 47437 100000 6 0 12 3
332 333 47437 17000 7 3 1 0
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
411 412 50706 44000 4 5 0 2
541 542 50706 60000 7 5 2 2
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
391 392 96929 13000 4 5 0 0
398 399 96929 67000 6 2 2 2
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
104 105 97935 17000 2 1 2 10
632 633 97935 187000 7 1 7 0

Observations

  • There are 5 duplicates entries for Customer Key.
  • It is safe to say that the duplicate Customer Keys are actually updated records for the same customer.
  • After clustering, we will analyze the groups corresponding to these sets of records and make more informed inferences.

Exploratory Data Analysis¶

User Defined Functions

In [11]:
# -----
# User defined function to plot labeled_barplot
# -----


def labeled_barplot(data, feature, perc=False, v_ticks=True, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    if v_ticks is True:
        plt.xticks(rotation=90)

    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage
    plt.show()  # show the plot
In [12]:
# -----
# User defined function to prints the 5 point summary and histogram, box plot,
#   and cumulative density distribution plots
# -----


def summary(data, x):
    """
    The function prints the 5 point summary and histogram, box plot,
    and cumulative density distribution plots for each
    feature name passed as the argument.

    Parameters:
    ----------

    x: str, feature name

    Usage:
    ------------

    summary('age')
    """

    x_min = data[x].min()
    x_max = data[x].max()
    Q1 = data[x].quantile(0.25)
    Q2 = data[x].quantile(0.50)
    Q3 = data[x].quantile(0.75)

    dict = {"Min": x_min, "Q1": Q1, "Q2": Q2, "Q3": Q3, "Max": x_max}
    ldf = pd.DataFrame(data=dict, index=["Value"])
    print(f"5 Point Summary of {x.capitalize()} Attribute:\n")
    print(tabulate(ldf, headers="keys", tablefmt="psql"))

    fig, axs = plt.subplots(nrows=3, ncols=1, figsize=(16, 22))
    sns.set_palette("Pastel1")

    # Histogram
    ax1 = sns.distplot(data[x], color="purple", ax=axs[0])
    ax1.axvline(np.mean(data[x]), color="purple", linestyle="--")
    ax1.axvline(np.median(data[x]), color="black", linestyle="-")
    ax1.set_title(f"{x.capitalize()} Density Distribution")

    # Boxplot
    ax2 = sns.boxplot(
        x=data[x], palette="cool", width=0.7, linewidth=0.6, showmeans=True, ax=axs[1]
    )
    ax2.set_title(f"{x.capitalize()} Boxplot")

    # Cummulative plot
    ax3 = sns.kdeplot(data[x], cumulative=True, linewidth=1.5, ax=axs[2])
    ax3.set_title(f"{x.capitalize()} Cumulative Density Distribution")

    plt.subplots_adjust(hspace=0.4)
    plt.show()

Univariate analysis¶

In [13]:
# Loop through all the variables of interest and perform a univariate statistical summary
for col in df.columns[2:]:
    print("=" * 115, "\n\n")
    summary(df, col)
=================================================================================================================== 


5 Point Summary of Avg_credit_limit Attribute:

+-------+-------+-------+-------+-------+--------+
|       |   Min |    Q1 |    Q2 |    Q3 |    Max |
|-------+-------+-------+-------+-------+--------|
| Value |  3000 | 10000 | 18000 | 48000 | 200000 |
+-------+-------+-------+-------+-------+--------+
=================================================================================================================== 


5 Point Summary of Total_credit_cards Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     1 |    3 |    5 |    6 |    10 |
+-------+-------+------+------+------+-------+
=================================================================================================================== 


5 Point Summary of Total_visits_bank Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    1 |    2 |    4 |     5 |
+-------+-------+------+------+------+-------+
=================================================================================================================== 


5 Point Summary of Total_visits_online Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    1 |    2 |    4 |    15 |
+-------+-------+------+------+------+-------+
=================================================================================================================== 


5 Point Summary of Total_calls_made Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    1 |    3 |    5 |    10 |
+-------+-------+------+------+------+-------+

Observations

  • Avg_credit_limit is skewed to the right.
  • The mean of Avg_credit_limit is greater than the midean and there are outliers to the right.
  • About 95% of the Customers have 150,000 or less in Avg_credit_limit
  • The minimum Total_credit_cards is 1 and the maximum is 10
  • About 95% of the Customers have 8 Total_credit_cards or less.
  • There are Customers that do not visit the bank at all...with 0 Total_visits_bank. It will be interesting to see if they are in a cluster.
  • There are Customers that do not do online banking at all...with 0 Total_visits_online. It will be interesting to see if they are in a cluster.
  • Total_visits_online is skewed to the right and has outliers.
  • There are Customers that do not call the Bank at all...with 0 Total_calls_made. It will be interesting to see if they are in a cluster.
  • Total_calls_made is slightly skewed to the right but there are no outliers.
In [14]:
for col in df.columns[3:]:
    labeled_barplot(df, col, perc=True)

Observations

  • About 67% of the Customers have between 4 to 7 cards.
  • About 9% of the Customers have only 1 card and about 3% have 10 cards.
  • These customers with 0 Total_visits_bank form about 15% of the Customer base.
  • About 40% of the Customers visit the bank 2-3 times
  • The Customers with 0 Total_visits_online form about 21% of the Customer base.
  • About 45% of the Customers visit the bank 1-2 times
  • The Customers with 0 Total_calls_made form about 15% of the Customer base.
  • About 33% of the Customers visit the bank 2-4 times

Bivariate analysis¶

Correlation Matrix¶

In [15]:
# Define the Correlation Matrix
cm = df.iloc[:, 2:]

# Display the matrix
plt.figure(figsize=(15, 7))
sns.heatmap(cm.corr(), annot=True, vmin=-1, vmax=1, fmt=".2f", cmap="Spectral")
plt.show()
In [16]:
# Create a pairplot

sns.pairplot(data=cm, diag_kind="kde", plot_kws={"color": "purple"})

plt.show()

Observations

  • Total_Credit_Cards and Avg_Credit_Limit are positively correlated. That's to be expected.
  • Avg_Credit_Limit and Total_visits_online are positively correlated, implying that online banking customers tend to have a high credit.
  • Total_visit_bank is negatively correlated to Total_visits_online and Total_calls_made. This is to be expected.
  • Avg_Credit_Limit and Total_calls_made are negatively correlated. This means that Customers that use telephone banking tend to have lower credit limits.
In [17]:
sns.pairplot(
    data=df[
        [
            "Total_visits_bank",
            "Total_visits_online",
            "Total_calls_made",
            "Total_Credit_Cards",
        ]
    ],
    hue="Total_Credit_Cards",
)

plt.show()

Observations

  • We see some form of clustering happening. Let's explore this further in 3D plot.
In [18]:
# Create a 3D Plot that shows the relationship of how the customers connect with the bank

# Create figure and axis
fig = plt.figure(figsize=(10, 10))
ax = plt.axes(projection="3d")

# Initialize data
x = df["Total_visits_bank"]
y = df["Total_visits_online"]
z = df["Total_calls_made"]

# Create a 3D scatter plot
ax.scatter(x, y, z, c=z, cmap="viridis")

# Set labels and title
ax.set_xlabel("Total_visits_bank")
ax.set_ylabel("Total_visits_online")
ax.set_zlabel("Total_calls_made")
ax.set_title("3D Scatter Plot")

# Set the azimuthal angle
ax.view_init(azim=60)

# Display the plot
plt.show()
In [19]:
########################################################################
# The code commented out below was used to generate the gif of the 3D plot.
# The gif was embadded as a markdown independently.
########################################################################


########################################################################
# fig = plt.figure(figsize=(10, 10))

# ax = Axes3D(fig)


# # Initialize data
# x = df["Total_visits_bank"]
# y = df["Total_visits_online"]
# z = df["Total_calls_made"]


# g = ax.scatter(x, y, z, c=x, marker="o", depthshade=False, cmap="Paired")
# ax.set_xlabel("Total Bank Visits")
# ax.set_ylabel("Total Visits Online")
# ax.set_zlabel("Total Calls Made")

# # produce a legend with the unique colors from the scatter
# legend = ax.legend(
#     *g.legend_elements(),
#     loc="lower center",
#     title="Total bank visits",
#     borderaxespad=-10,
#     ncol=4
# )
# ax.add_artist(legend)

# def rotate(angle):
#     ax.view_init(azim=angle)


# angle = 1
# ani = animation.FuncAnimation(fig, rotate, frames=np.arange(0, 360, angle), interval=1)
# ani.save("Cluster_plot.gif", writer=animation.PillowWriter(fps=25))
########################################################################

Cluster_plot.gif

Insights Based on EDA¶

  • In addition to all the observations above, we can clearly see 3 clusters:

    ◎ Cluster 1: High Online-High Credit-Low Calls-Low BanksVisit Customers.

    ◎ Cluster 2: High Calls-Mid Credit-Low Online-Low BanksVisit Customers.

    ◎ Cluster 3: High BanksVisit-Low Credit-Low Online-Low Calls Customers.

Data Preprocessing¶

In [20]:
# Make a copy of the dataframe and drop Sl_No & Customer Key
df2 = df.copy()
df.drop(["Sl_No", "Customer Key"], axis=1, inplace=True)

Scaling¶

In [21]:
# Create an instance of StandardScaler
scaler = StandardScaler()

# Fit and transform the DataFrame
scaled_df = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
In [22]:
display(scaled_df.head())
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
0 1.740 -1.249 -0.860 -0.547 -1.252
1 0.410 -0.788 -1.474 2.521 1.892
2 0.410 1.059 -0.860 0.134 0.146
3 -0.122 0.136 -0.860 -0.547 0.146
4 1.740 0.597 -1.474 3.202 -0.204

Observations

  • The data has been scaled.
In [23]:
# Make copies for K-Means and Hierarchical Clustering.
km_df = scaled_df.copy()
hc_df = scaled_df.copy()

K-means Clustering¶

Checking Elbow Plot¶

In [24]:
######
# Code to generate the Inertia Values, and create the Elbow Plot with those values.
######

# List to store cluster and intra cluster distance
clusters = []
# Variance - Intra cluster sum of distance ; within cluster sum of distance
inertia_vals = [] 

# Multiple values of K are utilized to come up with the optimum cluster value
# Cluster number and intra cluster distance is appended for plotting the elbow curve
for k in range(1, 11, 1):

    # train clustering with the specified K
    model = KMeans(n_clusters=k, random_state=1)
    model.fit(km_df)

    # append model to cluster list
    clusters.append(model)
    # append inertia to list
    inertia_vals.append(model.inertia_)
 
print("Inertia Values:")
display(inertia_vals)   # ---Intra cluster distance


# Plot the inertia vs K values
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11, 1), inertia_vals, marker="*")
plt.title("Sum of Intra Cluster Distance across different values of K")

plt.axvline(3, linestyle="--")

plt.xlabel("K - clusters")
plt.ylabel("Inertia (Intra Cluster Distance)")

plt.xticks(range(1, 11, 1))

plt.show()
Inertia Values:
[3300.0,
 2040.989816478495,
 933.0437490000536,
 780.7736895551767,
 704.4759188657506,
 642.4285451423209,
 600.2238778375961,
 574.4418958177624,
 538.8269188945017,
 509.1631378891258]
In [25]:
# Using KElbowVisualizer fromYellowBrick
model = KMeans(random_state=1)
visualizer = KElbowVisualizer(model, k=(2, 10), timings=True)
visualizer.fit(km_df)  # fit the data to the visualizer
visualizer.show()  # finalize and render figure
Out[25]:
<AxesSubplot:title={'center':'Distortion Score Elbow for KMeans Clustering'}, xlabel='k', ylabel='distortion score'>

Observations

  • We see a clear elbow at K = 3.
  • Let's check the Silhouette Scores.

Checking Silhouette Score¶

In [26]:
#####
# Code to calculate the Silhouette Scores
#####
silhouette = []

for i in range(1, 10, 1):

    print("~" * 40)
    print(clusters[i])
    print("Silhouette score:", silhouette_score(km_df, clusters[i].predict(km_df)))
    silhouette.append(round(silhouette_score(km_df, clusters[i].predict(km_df)), 3))
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMeans(n_clusters=2, random_state=1)
Silhouette score: 0.41842496663230405
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMeans(n_clusters=3, random_state=1)
Silhouette score: 0.5157182558882754
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMeans(n_clusters=4, random_state=1)
Silhouette score: 0.35566706193773706
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMeans(n_clusters=5, random_state=1)
Silhouette score: 0.2717470361094591
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMeans(n_clusters=6, random_state=1)
Silhouette score: 0.25590676529850875
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMeans(n_clusters=7, random_state=1)
Silhouette score: 0.2479864465613871
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMeans(random_state=1)
Silhouette score: 0.2414240144772954
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMeans(n_clusters=9, random_state=1)
Silhouette score: 0.2184645050766369
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
KMeans(n_clusters=10, random_state=1)
Silhouette score: 0.2140658673097117
In [27]:
#####
# Code to plot the Silhouette Scores
#####
plt.figure(figsize=(10, 5))
plt.plot(range(2, 11, 1), silhouette, marker="*")
plt.title("Sum of Intra Cluster Distance across different values of K")
plt.axvline(3, linestyle="--")

plt.xlabel("K - clusters")
plt.ylabel("Silhouette Score")
plt.xticks(range(2, 11, 1))
plt.show()

Observations

  • The Silhouette Scores are highest at K = 3. This is in line with the Elbow Plot.

Silhouette Visualizer¶

We will create Silhouette Visualizer for K=3, 4 & 5¶

In [28]:
# Silhouette Visualizer for K=3

plt.figure(figsize=(15, 8))
visualizer = SilhouetteVisualizer(KMeans(3, random_state=1))
visualizer.fit(km_df)
visualizer.show()
Out[28]:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 660 Samples in 3 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [29]:
# Silhouette Visualizer for K=4

plt.figure(figsize=(15, 8))
visualizer = SilhouetteVisualizer(KMeans(4, random_state=1))
visualizer.fit(km_df)
visualizer.show()
Out[29]:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 660 Samples in 4 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>
In [30]:
# Silhouette Visualizer for K=5

plt.figure(figsize=(15, 8))
visualizer = SilhouetteVisualizer(KMeans(5, random_state=1))
visualizer.fit(km_df)
visualizer.show()
Out[30]:
<AxesSubplot:title={'center':'Silhouette Plot of KMeans Clustering for 660 Samples in 5 Centers'}, xlabel='silhouette coefficient values', ylabel='cluster label'>

Observations

  • Silhouette visualizer for K = 3 is the best.
  • Silhouette visualizer for K = 4 is unsuitable because the distributions of k=1 & k=3 are a similar. The distributions should be unique.
  • Silhouette visualizer for K = 5 is unsuitable because there is a negative value at k=3.

Intercluster Distance Visualizer¶

Checking the Intercluster Distance¶

In [31]:
#####
# Code to visualise the Intercluster Distance
#####

plt.figure(figsize=(15, 8))

# Instantiate the clustering model and visualizer
model = KMeans(3, random_state=1)

visualizer = InterclusterDistance(model, random_state=1)

# Fit the data to the visualizer
visualizer.fit(km_df)

# Finalize and render the figure
visualizer.show()
Out[31]:
<AxesSubplot:title={'center':'KMeans Intercluster Distance Map (via MDS)'}, xlabel='PC2', ylabel='PC1'>

Observations

  • The Clusters are well spaced out and there are no intersections. This is a good thing.

We will proceed with K=3 for K-means Clustering¶

Creating Final Model¶

In [32]:
%%time
kmeans = KMeans(n_clusters=3, random_state=1)
kmeans.fit(km_df)
CPU times: user 122 ms, sys: 39.9 ms, total: 162 ms
Wall time: 37.4 ms
Out[32]:
KMeans(n_clusters=3, random_state=1)
In [33]:
# adding kmeans cluster labels
df["K_means_segments"] = kmeans.labels_
df2["K_means_segments"] = kmeans.labels_
km_df["K_means_segments"] = kmeans.labels_

km_df["K_means_segments"].value_counts()
Out[33]:
1    386
0    224
2     50
Name: K_means_segments, dtype: int64

Observations

  • Our 3 Clusters each have 386, 224 and 50 members respectively.

PCA - Principal Component Analysis¶

While we do not have a curse of dimentionality with this dataset, we will use PCA to reduce the features from 5 to 2 so that we can easily visualize it.¶

In [34]:
#####
# Dimensionality reduction whereby we convert the dataset from 5 columns to 2 columns
# where each of these columns are combined to created pc1, pc2.
#####

pca = PCA(n_components=2, random_state=1)

principalComponents = pca.fit_transform(km_df)

X_std = pd.DataFrame(data=principalComponents, columns=["pc1", "pc2"])

X_std.head()
Out[34]:
pc1 pc2
0 0.566 0.520
1 -1.594 3.119
2 0.593 0.819
3 -0.247 0.134
4 1.661 3.660
In [35]:
#####
# Code to check the amount of variance explained by the principal components
# (the information captured by the PCs)
#####

variance = pca.explained_variance_ratio_

print("Variance explained by each Principal Component:")

total = 0
j = 1
for i in variance:
    print("PC", j, ":", round(i * 100, 2))
    j += 1
    total += i

# Format the statement with the value of 'total'
statement = (
    "The 2 Principal Components explain {}% of the information in the data".format(
        round(total * 100, 2)
    )
)
# Print the statement
print("\n")
print(statement)
Variance explained by each Principal Component:
PC 1 : 48.54
PC 2 : 35.01


The 2 Principal Components explain 83.55% of the information in the data
In [36]:
# Plot in 2D

X_std["labels"] = kmeans.labels_

plt.figure(figsize=(10, 8))

sns.scatterplot(x=X_std.pc1, y=X_std.pc2, hue=X_std.labels)
Out[36]:
<AxesSubplot:xlabel='pc1', ylabel='pc2'>

Observations

  • Clearly, there are 3 clusters.
  • Cluster 1 has the most datapoint, followed by Cluster 0 and Cluster 2 has the least datapoints.

K-means Cluster Profiling¶

In [37]:
#####
# Code to visualise the Boxplots of numerical variables for each cluster obtained using K-means Clustering
#####

fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle(
    "Boxplot of numerical variables for each cluster obtained using K-means Clustering",
    fontsize=20,
)

for ii in range(5):
    sns.boxplot(ax=axes[ii], y=df[df.columns[ii]], x=df["K_means_segments"])

fig.tight_layout(pad=2.0)
In [38]:
#####
# Code to map each Cluster to a Profile
# C1=Cluster 1, C2=Cluster 2, C3=Cluster 3
#####


def cluster_map(val):

    if val == 2:

        return "C1-OnlineBankers - Customers with: High Credit Limit & High Total Credit Cards"

    elif val == 1:

        return "C2-TelephoneBankers - Customers with: Moderate Credit Limit & Moderate Total Credit Cards"

    else:

        return "C3-PhysicalBankers - Customers with: Low Credit Limit & Low Total Credit Cards"
In [39]:
#Apply the Profiles to the Cluster Segments

df["K_means_Group"] = df["K_means_segments"].apply(cluster_map)

pd.set_option('display.max_colwidth', None)

df.groupby("K_means_Group").mean().round(2).reset_index().sort_values(by=["Total_Credit_Cards"], ascending=False)
Out[39]:
K_means_Group Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made K_means_segments
0 C1-OnlineBankers - Customers with: High Credit Limit & High Total Credit Cards 141040.000 8.740 0.600 10.900 1.080 2.000
1 C2-TelephoneBankers - Customers with: Moderate Credit Limit & Moderate Total Credit Cards 33782.380 5.520 3.490 0.980 2.000 1.000
2 C3-PhysicalBankers - Customers with: Low Credit Limit & Low Total Credit Cards 12174.110 2.410 0.930 3.550 6.870 0.000

Observations

  • The Clusters have been assigned to Customer Profiles:

    ◎ Cluster 1:OnlineBankers: Customers with High Credit Limit, High Total Credit Cards, High Online Banking, Low Telephone Banking and Low BanksVisit.

    ◎ Cluster 2:TelephoneBankers: Customers with Medium Credit Limit, Medium Total Credit Cards, Low Online Banking, High Telephone Banking and Low BanksVisit.

    ◎ Cluster 3:PhysicalBankers: Customers with Low Credit Limit, Low Total Credit Cards, Low Online Banking, Low Telephone Banking and High BanksVisit.

Hierarchical Clustering¶

We will now perform Hierarchical Clustering algorithm on the dataset and observe it's performance and clustering.¶

Cophenetic Coefficient¶

In [40]:
# Calculate the Cophenetic Coefficient for various pairs of distance metric and linkage methods.

# list of distance metrics
distance_metrics = ["euclidean", "mahalanobis", "cityblock", "cosine", "correlation"]

# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted"]

high_cophenet_corr = 0  # ---Stores the highest cophenetic coefficient

high_dm_lm = [0, 0]

for dm in distance_metrics:

    print("-" * 50)

    for lm in linkage_methods:

        # Calulcate the dendrogram using the linkage
        Z = linkage(hc_df, metric=dm, method=lm)

        c, coph_dists = cophenet(Z, pdist(hc_df))

        print(
            "Cophenetic correlation for {} distance and {} linkage is {}.".format(
                dm.capitalize(), lm.capitalize(), round(c, 3)
            )
        )

        if high_cophenet_corr < c:
            high_cophenet_corr = c
            high_dm_lm[0] = dm
            high_dm_lm[1] = lm
--------------------------------------------------
Cophenetic correlation for Euclidean distance and Single linkage is 0.739.
Cophenetic correlation for Euclidean distance and Complete linkage is 0.86.
Cophenetic correlation for Euclidean distance and Average linkage is 0.898.
Cophenetic correlation for Euclidean distance and Weighted linkage is 0.886.
--------------------------------------------------
Cophenetic correlation for Mahalanobis distance and Single linkage is 0.706.
Cophenetic correlation for Mahalanobis distance and Complete linkage is 0.666.
Cophenetic correlation for Mahalanobis distance and Average linkage is 0.833.
Cophenetic correlation for Mahalanobis distance and Weighted linkage is 0.781.
--------------------------------------------------
Cophenetic correlation for Cityblock distance and Single linkage is 0.725.
Cophenetic correlation for Cityblock distance and Complete linkage is 0.873.
Cophenetic correlation for Cityblock distance and Average linkage is 0.896.
Cophenetic correlation for Cityblock distance and Weighted linkage is 0.883.
--------------------------------------------------
Cophenetic correlation for Cosine distance and Single linkage is 0.747.
Cophenetic correlation for Cosine distance and Complete linkage is 0.532.
Cophenetic correlation for Cosine distance and Average linkage is 0.672.
Cophenetic correlation for Cosine distance and Weighted linkage is 0.566.
--------------------------------------------------
Cophenetic correlation for Correlation distance and Single linkage is 0.61.
Cophenetic correlation for Correlation distance and Complete linkage is 0.489.
Cophenetic correlation for Correlation distance and Average linkage is 0.66.
Cophenetic correlation for Correlation distance and Weighted linkage is 0.532.
In [41]:
# printing the combination of distance metric and linkage method with the highest cophenetic correlation
print(
    "Highest cophenetic correlation is {}, which is obtained with {} distance and {} linkage.".format(
        high_cophenet_corr, high_dm_lm[0].capitalize(), high_dm_lm[1].capitalize()
    )
)
Highest cophenetic correlation is 0.8977080867389372, which is obtained with Euclidean distance and Average linkage.

Exploring different linkage methods with Euclidean distance only.

In [42]:
# Calculate the Cophenetic Coefficient for Euclidean distance and various linkage methods.

# list of linkage methods
linkage_methods = [
    "single",
    "complete",
    "average",
    "weighted",
    "ward",  # Ward only works with Euclidean distance
]

high_cophenet_corr = 0
high_dm_lm = [0, 0]

for lm in linkage_methods:
    Z = linkage(hc_df, metric="euclidean", method=lm)
    c, coph_dists = cophenet(Z, pdist(hc_df))
    print(
        "Cophenetic correlation for Euclidean distance & {} linkage is {}.".format(
            lm.capitalize(), c
        )
    )
    if high_cophenet_corr < c:
        high_cophenet_corr = c
        high_dm_lm[0] = "euclidean"
        high_dm_lm[1] = lm
Cophenetic correlation for Euclidean distance & Single linkage is 0.7391220243806552.
Cophenetic correlation for Euclidean distance & Complete linkage is 0.8599730607972423.
Cophenetic correlation for Euclidean distance & Average linkage is 0.8977080867389372.
Cophenetic correlation for Euclidean distance & Weighted linkage is 0.8861746814895477.
Cophenetic correlation for Euclidean distance & Ward linkage is 0.7415156284827493.
In [43]:
# Printing the linkage method with the highest cophenetic correlation
print()
print(
    "Highest cophenetic correlation for Euclidean distance is {}, which is obtained with {} linkage.".format(
        high_cophenet_corr, high_dm_lm[1].capitalize()
    )
)
Highest cophenetic correlation for Euclidean distance is 0.8977080867389372, which is obtained with Average linkage.

Observations

  • Euclidean distance and Average linkage gave us the hieghest cophenetic coefficient.

Checking Dendrograms¶

Exploring the dendrograms for Euclidean distance with the different linkage methods.

In [44]:
# list of linkage methods
linkage_methods = ["single", "complete", "average", "weighted", "ward"]

results_cophenetic_coef = []

for i in linkage_methods:

    plt.figure(figsize=(20, 8))

    plt.ylabel("Dendrogram Distance")

    Z = linkage(hc_df, i, metric="euclidean")

    cc, cophn_dist = cophenet(Z, pdist(hc_df))

    dendrogram(
        Z,
        leaf_rotation=90.0,
        p=7,
        leaf_font_size=10,
        truncate_mode="level",
        color_threshold=60,
        link_color_func=lambda k: "purple",
    )

    plt.tight_layout()

    plt.title(
        "Linkage Type: " + i + " having cophenetic coefficient : " + str(round(cc, 3)),
        fontsize=20,
    )

    plt.show()
    results_cophenetic_coef.append((i, cc))
    print(i, cc)
single 0.7391220243806552
complete 0.8599730607972423
average 0.8977080867389372
weighted 0.8861746814895477
ward 0.7415156284827493
In [45]:
# Create and display a dataframe to compare cophenetic correlations for different linkage methods for euclidean distance
results_cophenetic_coef_df = pd.DataFrame(
    results_cophenetic_coef, columns=["LinkageMethod", "CopheneticCoefficient"]
)
results_cophenetic_coef_df
Out[45]:
LinkageMethod CopheneticCoefficient
0 single 0.739
1 complete 0.860
2 average 0.898
3 weighted 0.886
4 ward 0.742
In [46]:
#####
# Code to calculate the Silhouette Scores
#####

sil_score = []

for i in range(2, 11):

    model = AgglomerativeClustering(
        n_clusters=i, affinity="euclidean", linkage="average"
    )

    model.fit(hc_df)

    df["HC_segments"] = model.labels_

    print("~" * 40)

    sc = round(silhouette_score(hc_df, df["HC_segments"]), 3)

    sil_score.append(sc)

    print(
        "Silhouette Score for K=",
        i,
        ":",
        round(silhouette_score(hc_df, df["HC_segments"]), 3),
    )
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Silhouette Score for K= 2 : 0.57
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Silhouette Score for K= 3 : 0.516
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Silhouette Score for K= 4 : 0.475
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Silhouette Score for K= 5 : 0.44
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Silhouette Score for K= 6 : 0.415
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Silhouette Score for K= 7 : 0.418
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Silhouette Score for K= 8 : 0.343
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Silhouette Score for K= 9 : 0.342
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Silhouette Score for K= 10 : 0.346
In [47]:
plt.figure(figsize=(10, 5))
plt.plot(range(2, 11, 1), sil_score, marker="*")
plt.title("Sum of Intra Cluster Distance across different values of K")

plt.axvline(3, linestyle="--")

plt.xlabel("HC - clusters")
plt.ylabel("Silhouette Score")
plt.xticks(range(2, 11, 1))
plt.show()

Observations

  • The Silhouette Scores are highest at K = 3 (apart from K = 2).
  • This is in line with our findings from the EDA and K-means clustering.

Creating the dendrograms for Euclidean distance with Average linkage.

In [48]:
#######
# We have identified the linkage and distance metric as euclidean and average
# Create dendrogram using these parameters at K = 3
#######

Z = linkage(hc_df, metric="euclidean", method="average")

plt.figure(figsize=(15, 10))
plt.title("Agglomerative Hierarchical Clustering Dendogram")
plt.xlabel("sample index")
plt.ylabel("Dendogram Distance")

dendrogram(
    Z,
    leaf_rotation=90.0,
    p=5,
    leaf_font_size=10,
    truncate_mode="level",
    color_threshold=3,
    link_color_func=lambda k: "purple",
)

# Threshold: we cut the tree at dendrogram distance of 3.2
plt.axhline(y=3.2, color="red")

plt.tight_layout()

Observations

  • We cut the tree at dendrogram distance of 3.2.
  • At this Threshold, we get 3 Clusters (sub-trees).
  • This is in line with our findings from the EDA and K-means clustering.

We Will Proceed with 3 Terminal Nodes for Hierarchical Clustering¶

Creating Final Model for Hierarchical Clustering¶

In [49]:
%%time
HCmodel = AgglomerativeClustering(n_clusters=3, affinity="euclidean", linkage="average")
HCmodel.fit(hc_df)
CPU times: user 6.44 ms, sys: 860 µs, total: 7.3 ms
Wall time: 6.14 ms
Out[49]:
AgglomerativeClustering(linkage='average', n_clusters=3)
In [50]:
# Add the cluster labels to the dataframes

df["HC_segments"] = HCmodel.labels_
df2["HC_segments"] = HCmodel.labels_
hc_df["HC_segments"] = HCmodel.labels_
In [51]:
display(hc_df.head())
display(df.head())
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made HC_segments
0 1.740 -1.249 -0.860 -0.547 -1.252 0
1 0.410 -0.788 -1.474 2.521 1.892 2
2 0.410 1.059 -0.860 0.134 0.146 0
3 -0.122 0.136 -0.860 -0.547 0.146 0
4 1.740 0.597 -1.474 3.202 -0.204 1
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made K_means_segments K_means_Group HC_segments
0 100000 2 1 1 0 1 C2-TelephoneBankers - Customers with: Moderate Credit Limit & Moderate Total Credit Cards 0
1 50000 3 0 10 9 0 C3-PhysicalBankers - Customers with: Low Credit Limit & Low Total Credit Cards 2
2 50000 7 1 3 4 1 C2-TelephoneBankers - Customers with: Moderate Credit Limit & Moderate Total Credit Cards 0
3 30000 5 1 1 4 1 C2-TelephoneBankers - Customers with: Moderate Credit Limit & Moderate Total Credit Cards 0
4 100000 6 0 12 3 2 C1-OnlineBankers - Customers with: High Credit Limit & High Total Credit Cards 1

Hierarchical Cluster Profiling¶

In [52]:
#####
# Code to visualise the Boxplots of numerical variables for each cluster obtained using Hierarchical Clustering
#####

fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle(
    "Boxplot of numerical variables for each cluster obtained using Heirarchical Clustering",
    fontsize=20,
)

for ii in range(5):
    sns.boxplot(ax=axes[ii], y=df[df.columns[ii]], x=df["HC_segments"])


fig.tight_layout(pad=2.0)

Observations

  • The Cluster segmentation for Hierarchical Clustering (HC) appears to be the same as that of K-means Clustering (KM) but with different labels.
  • HC Cluster 1 is KM Cluster 2, HC Cluster 0 is KM Cluster 1, HC Cluster 2 is KM Cluster 0.
  • We will re-label the HC Clusters to match up with K-means, since they are essentially the same.
In [53]:
# Map the HC number groups to match that of KM number groups
df["HC_segments"] = df["HC_segments"].map({0: 1, 1: 2, 2: 0})
hc_df["HC_segments"] = hc_df["HC_segments"].map({0: 1, 1: 2, 2: 0})
In [54]:
df["HC_Group"] = df["HC_segments"].apply(cluster_map)

pd.set_option("display.max_colwidth", None)

df.groupby("HC_Group").mean().round(2).reset_index().sort_values(
    by=["Total_Credit_Cards"], ascending=False
)
Out[54]:
HC_Group Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made K_means_segments HC_segments
0 C1-OnlineBankers - Customers with: High Credit Limit & High Total Credit Cards 141040.000 8.740 0.600 10.900 1.080 2.000 2.000
1 C2-TelephoneBankers - Customers with: Moderate Credit Limit & Moderate Total Credit Cards 33713.180 5.510 3.490 0.980 2.010 1.000 1.000
2 C3-PhysicalBankers - Customers with: Low Credit Limit & Low Total Credit Cards 12197.310 2.400 0.930 3.560 6.880 0.000 0.000

Cluster Profiling and Comparison¶

Cluster Profiling: K-means Clustering¶

In [55]:
km_cluster_profile = df.groupby("K_means_segments").mean()
In [56]:
km_cluster_profile["Count_in_Each_Segment"] = (
    df.groupby("K_means_segments")["Avg_Credit_Limit"].count().values
)

# Exclude the HC_segments
column_to_exclude = "HC_segments"
km_cluster_profile = km_cluster_profile.loc[
    :, km_cluster_profile.columns != column_to_exclude
]
In [57]:
km_cluster_profile.style.highlight_max(color="lightgreen", axis=0).highlight_min(
    color="lightyellow"
)
Out[57]:
  Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made Count_in_Each_Segment
K_means_segments            
0 12174.107143 2.410714 0.933036 3.553571 6.870536 224
1 33782.383420 5.515544 3.489637 0.981865 2.000000 386
2 141040.000000 8.740000 0.600000 10.900000 1.080000 50

Cluster Profiling: Hierarchical Clustering¶

In [58]:
hc_cluster_profile = df.groupby("HC_segments").mean()
In [59]:
hc_cluster_profile["Count_in_Each_Segment"] = (
    df.groupby("HC_segments")["Avg_Credit_Limit"].count().values
)

# Exclude the HC_segments
column_to_exclude = "K_means_segments"
hc_cluster_profile = hc_cluster_profile.loc[
    :, hc_cluster_profile.columns != column_to_exclude
]
In [60]:
hc_cluster_profile.style.highlight_max(color="lightgreen", axis=0).highlight_min(
    color="lightyellow"
)
Out[60]:
  Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made Count_in_Each_Segment
HC_segments            
0 12197.309417 2.403587 0.928251 3.560538 6.883408 223
1 33713.178295 5.511628 3.485788 0.984496 2.005168 387
2 141040.000000 8.740000 0.600000 10.900000 1.080000 50

Observations

  • Both K_means and Hierarchical Clustering practically the same.
  • Segment 0 of HC appear to be missing an observation when compared with segment 0 of KM. So one observation appears to have been "misclassified" by the HC algorithm. We will take a closer look.

Finding the "misclassified" record¶

In [61]:
# Assuming the two columns are named 'column1' and 'column2' in the DataFrame 'df'
unequal_rows = df[df["K_means_segments"] != df["HC_segments"]]
unequal_rows
Out[61]:
Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made K_means_segments K_means_Group HC_segments HC_Group
313 7000 4 2 2 4 0 C3-PhysicalBankers - Customers with: Low Credit Limit & Low Total Credit Cards 1 C2-TelephoneBankers - Customers with: Moderate Credit Limit & Moderate Total Credit Cards
In [62]:
record = data.iloc[313]
record
Out[62]:
Sl_No                    314
Customer Key           72156
Avg_Credit_Limit        7000
Total_Credit_Cards         4
Total_visits_bank          2
Total_visits_online        2
Total_calls_made           4
Name: 313, dtype: int64

Observations

  • The "misclassified" observation is the Customer with Sl_No 314 and Customer Key 72156.
  • The "misclassified" observation has boarderline values, so the misclassification is understandable. We will ignore and proceed.

Comparison Continued...¶

Creating Boxplots to understand the customer distribution among the clusters.

In [63]:
#####
# Boxplots to understand the customer distribution among the clusters for K_means Clustering.
#####
fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle(
    "Boxplot of numerical variables for each cluster obtained using K-means Clustering",
    fontsize=20,
)

for ii in range(5):
    sns.boxplot(ax=axes[ii], y=df[df.columns[ii]], x=df["K_means_segments"])

fig.tight_layout(pad=2.0)
In [64]:
#####
# Boxplots to understand the customer distribution among the clusters for Hierarchical Clustering.
#####
fig, axes = plt.subplots(1, 5, figsize=(16, 6))
fig.suptitle(
    "Boxplot of numerical variables for each cluster obtained using Hierarchical Clustering",
    fontsize=20,
)

for ii in range(5):
    sns.boxplot(ax=axes[ii], y=df[df.columns[ii]], x=df["HC_segments"])

fig.tight_layout(pad=2.0)

Creating Barplots to understand the customer distribution among the clusters.

In [65]:
# Bar plot for the mean values along the segments for K_means

km_df.groupby("K_means_segments").mean().plot.bar(figsize=(15, 6))
Out[65]:
<AxesSubplot:xlabel='K_means_segments'>
In [66]:
# Bar plot for the mean values along the segments for HC

hc_df.groupby("HC_segments").mean().plot.bar(figsize=(15, 6))
Out[66]:
<AxesSubplot:xlabel='HC_segments'>

Creating Crosstab to understand the customer distribution among the clusters.

In [67]:
#####
# Comparing output of both algorithms
#####

pd.crosstab(df["HC_Group"], df["K_means_Group"])
Out[67]:
K_means_Group C1-OnlineBankers - Customers with: High Credit Limit & High Total Credit Cards C2-TelephoneBankers - Customers with: Moderate Credit Limit & Moderate Total Credit Cards C3-PhysicalBankers - Customers with: Low Credit Limit & Low Total Credit Cards
HC_Group
C1-OnlineBankers - Customers with: High Credit Limit & High Total Credit Cards 50 0 0
C2-TelephoneBankers - Customers with: Moderate Credit Limit & Moderate Total Credit Cards 0 386 1
C3-PhysicalBankers - Customers with: Low Credit Limit & Low Total Credit Cards 0 0 223

Observations

  • The Plots and Crosstab shows that the two algorithms produced practically the same results.

Checking the Clusters of the Duplicated Values of Customer Key¶

In [68]:
# getting the count for each unique value in Customer_Key
filtered = df2.groupby("Customer Key").count()

for i in filtered.loc[filtered.Sl_No >= 2].index:
    display(df2.loc[df2["Customer Key"] == i])
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made K_means_segments HC_segments
48 49 37252 6000 4 0 2 8 0 2
432 433 37252 59000 6 2 1 2 1 0
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made K_means_segments HC_segments
4 5 47437 100000 6 0 12 3 2 1
332 333 47437 17000 7 3 1 0 1 0
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made K_means_segments HC_segments
411 412 50706 44000 4 5 0 2 1 0
541 542 50706 60000 7 5 2 2 1 0
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made K_means_segments HC_segments
391 392 96929 13000 4 5 0 0 1 0
398 399 96929 67000 6 2 2 2 1 0
Sl_No Customer Key Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made K_means_segments HC_segments
104 105 97935 17000 2 1 2 10 0 2
632 633 97935 187000 7 1 7 0 2 1

Observations

  • It can be observed that three (3) of the five(5) customers have actually changed their clusters/segments.

  • This suggests that by turning customers to digital banking customers or by providing credit limit increase, we can actually move customers to a more desirable & profitable cluster.

Summary: K-means Clustering vs Hierarchical Clustering¶

Comparing the K-means Clustering and Hierarchical Clustering Algorithms, we have the following findings (in addition to all the observations above):

  • Comapring Execution time: Surprisingly, HC ran faster than K_means. K_means had a total CPU time of 127 ms & Wall time: 27.8 ms while HC algorithm had a total CPU time of 6.98 ms, and Wall time: 5.65 ms
  • Three (3) clusters are obtained as the appropriate number of clusters from both algorithms.
  • Both Clustering techniques gave distinct values.
  • K_means have 50 observations in Group 0, 386 observations in Group 1 & 224 observations in Group 2. HC have 50 observations in Group 0, 387 observations in Group 1 & 223 observations in Group 2.

Business Insights and Recommendations¶

Actionable Insights and Recommendations¶

  • Based on both Clustering Techniques & Exploratory Data Analysis, there are Three (3) distinct Customer Profiles:

    1. Profile 1:OnlineBankers: Customers with High Credit Limit, High Total Credit Cards, High Online Banking, Low Telephone Banking and Low BanksVisit.

    2. Profile 2:TelephoneBankers: Customers with Medium Credit Limit, Medium Total Credit Cards, Low Online Banking, High Telephone Banking and Low BanksVisit.

    3. Profile 3:PhysicalBankers: Customers with Low Credit Limit, Low Total Credit Cards, Low Online Banking, Low Telephone Banking and High BanksVisit.

  • AllLife Bank should curate marketing campaigns...

    1. that are targeted at customers and prospects in the 3 segments thereby personalizing Marketing campaigns to target both new and existing customers.
    2. to move customers to digital banking customers (OnlineBankers & TelephoneBankers) thereby move customers to a more desirable & profitable cluster
    3. to provide credit limit increase, this will also move customers to a more desirable & profitable cluster.

  • The bank should upgrade the service delivery model based on specific Customer Profiles by tailoring contact methods to these customer preferences. This will ensure that customer queries are resolved faster.


Further Recommendations:

  1. Conduct customer surveys: The Bank should gather feedback directly from customers through surveys or interviews to understand their preferences, needs, and pain points. This qualitative information can supplement the quantitative analysis and provide valuable insights for refining the segmentation strategy..

  2. Incorporate predictive modeling: Utilize predictive modeling techniques, such as machine learning algorithms, to anticipate customer behavior within each segment. Predictive models can help identify opportunities for cross-selling, upselling, or customer churn prevention, allowing for more personalized marketing strategies.

  3. Continuously evaluate and iterate: Customer segmentation is an ongoing process. Regularly evaluate the effectiveness of the segmentation strategy and make adjustments as needed. Stay agile and responsive to changing customer behaviors, market trends, and business goals.